EE0005 Mini Project

Topic: Cardiovascular Disease Analysis and Prediction
Team Members: Zhang Haoran, Huang Menghui, Sun Qiyang, Tham Yong Hao
Tutorial Group: EE08
Professor: Chen Lihui

Outline

  1. Introduction
  2. New Knowlege We Have Learned
  3. Contribution
  4. Importing Libraries
  5. Data Cleaning and Exploratory Data Analysis
  6. Feature Engineering
  7. Visualiztion
  8. Data Merge
  9. Model Training and Testing:
    • Decision Tree Classifier
    • Random Forest Classifier
    • XGBoost
    • K-Nearest Neighbors
    • Support Vector Machine
    • Neural Network
    • Model Ensembling - Blending
  10. Real-time Model Prediction (Our application)
  11. Conclusion

1. Introduction

Cardiovascular diseases (CVDs) are the leading cause of death globally, taking an estimated 17.9 million lives each year. Prediction and precaution of CVDs are vital for ones' survival rates. Thus, we want to analyse a cardiovascular diseases related dataset which consists of 70000 data collected at the moment of medical examination, and then fit them into different machine learning models. In the end, we will develop an application for real time prediction based on what we learned from this datatset.

2. New Knowlege we Have Learned

  1. Ways (Interpolation and Log Transformation) to deal with incorrect data instead of simply drop them off.
  2. Feature Engineering (Create new features based on the data we have and the information from CVDs related scientific articles, Data Standardization, Data Merge, Resampling)
  3. More methods to draw the plots by plotly
  4. New models: Random Forest, XgBoost, Support Vector Machine, Neural Network, Model Ensembling--Blending.
  5. Try to set up an application to predict the probability that one person will get CVDs

3. Contribution

Remarks:

4. Importing Libraries

5. Exploratory Data Analysis and Data Cleaning

Source of dataset: https://www.kaggle.com/sulianova/cardiovascular-disease-dataset

Data description There are 3 types of input features:

Objective: factual information; Examination: results of medical examination; Subjective: information given by the patient.

Features:

  1. Age | Objective Feature | age | int (days)
  2. Height | Objective Feature | height | int (cm) |
  3. Weight | Objective Feature | weight | float (kg) |
  4. Gender | Objective Feature | gender | categorical code | 1: women, 2: men
  5. Systolic blood pressure | Examination Feature | ap_hi | int |
  6. Diastolic blood pressure | Examination Feature | ap_lo | int |
  7. Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |
  8. Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |
  9. Smoking | Subjective Feature | smoke | binary |
  10. Alcohol intake | Subjective Feature | alco | binary |
  11. Physical activity | Subjective Feature | active | binary |
  12. Presence or absence of cardiovascular disease | Target Variable | cardio | binary |

Overall Statistical Description

Checking for Null Values

Checking for Dataset Imbalance

This indecates that the value of '0' and '1' is in proportion of 35021/34979=1.001200, which is almost 1. So the data is impartial.

Data Cleaning and Data Preparation

Number of datapoints removed is 70000 - 69976 = 24

Dealing with illogical ap_hi and ap_lo values

Dealing with Outliers

We could see there are thousands of outliers. As the numbers of outliers for height and weight features is still relatively smaller compared to ap_hi and ap_lo, we will only use this method to clean up the ouliers for height and weight, to avoid too much data loss

We will now clean up outliers for ap_hi and ap_lo with data from our research

Feature Engineering

Visualization

Separating categorical variables from numerical variables

Standard Statistical Distributions for Numerical Variables

Distribution of Age Variable

We can see that the risk of getting cardiovascular disease increases with age (year).

Creating Mutual Box Plots between Numerical Variables and Presence of Cardiovascular Disease

Checking the Distributions of Categorical Variables

Analysing Relationship Between Gender and Presence of Cardiovascular Disease

Gender - Men are more likely than women to develop cardiovascular disease at an earlier age. Women are slightly more likely than men to develop cardiovascular disease as they get older.

Analysing Relationship Between Cholesterol and Presence of Cardiovascular Disease

In general higher cholesterol levels lead to higher probability of presence of cardiovascular disease.

Analysing Relationship Between Glucose and Presence of Cardiovascular Disease

In general, higher glucose levels lead to higher probability of presence of cardiovascular disease.

Analysing Relationship Between Smoke and Presence of Cardiovascular Disease

It can be seen that there is no significant correlation between non-smoker and smoker with presence of cardiovascular disease

We found out that smoke variable is very unbalanced, so we tried to make it balance through undersampling.

We found that undersampling also does not help to improve the reliability of this statistic, so we decided to drop it.

Analysing Relationship Between Alcohol and Presence of Cardiovascular Disease

It can be seen that there is no significant correlation between non-drinker and drinker with presence of cardiovascular disease

We found that undersampling helped to make drinkers seem more risky than non-drinkers in getting cardiovascular disease, but the relationship is still very weak, and we do not think this statistic is reliable, so we decided to drop it.

Analysing Relationship Between Active and Presence of Cardiovascular Disease

We observed that the general trend is that inactive people are more prone to cardiovascular disease compared to active people

Compute pairwise correlation of variables

We can see that the variables that are very highly correlated to cardio are ap_hi, map, ap_lo, pp

PCA Analysis

Principal component analysis (PCA) is an unsupervised technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss. It tries to preserve the essential parts that have more variation of the data and remove the non-essential parts with fewer variation.

Principal component 1 holds 29.80% of the information while principal component 2 holds 14.10% of the information . Projecting 15-dimensional data to 2-dimensional data resulted in loss of 56.1% of information

The variation between those that have cardiovascular disease and those that do not have cardiovascular disease is quite small from PCA analysis, but we are at least able to observe the slight shift in positions between these 2 classes in the plot above and found the features with higher importance (explained variance) for our classification objective.

Data Merge

We found this dataset because many researches showed that diabetes is a great factor causing cardiovascular disease. We want to introduce the “diabetes” value into our cardio dataset. (After adding this variable, the accuracy of predictions on our main cardiovasuclar disease dataset improved)

Source of dataset: https://www.kaggle.com/mathchi/diabetes-data-set

Dealing with Illogical Values in Data

Now, all the the blanks have been filled

By comparing the features as well as their value type of the two datasets, we decide use 'Age','BMI' and 'BloodPressure' of Diabetes dataset which are corresponding to 'year','map','bmi' in Cardio dataset.

Correlation with Outcome(Diabetes Positive):

  1. BMI : 0.32
  2. Age : 0.24
  3. BloodPresure : 0.17

Now, the mean value of year in both datasets are 49 and 53, which are almost the same. Besides, the range of the year of our diabetes dataset covers that of our cardio dataset.

This shows the target value---Outcome in diabetes dataset is balanced

Predict the Diabetes Outcome in Our Main Dataset

Model Training and Testing

For Model Training and Testing, we initialised and tuned 7 different models and executed a model ensembling technique to combine the decisions from our models and improve the overall performance.

Models tested out\:

  1. Decision Tree Classifier
  2. Random Forest Classifier
  3. Xgboost
  4. K-Nearest Neighbours
  5. Support Vector Machine
  6. Neural Network

Data Preparation

Decision Tree Classifier

This is not good enough, we shall improving decision tree classifier by tuning max_depth

Random Forest Classifier

This model fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

We realised max_depth gave the best test classification accuracy

XGBoost

XGBoost stands for eXtreme Gradient Boosting. It is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. Gradient boosting is an approach where new models are created that predict the residuals or errors of prior models and then added together to make the final prediction. It is called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models.

After a few trials, we obtained our best set of parameters below. As hyperopt samples parameters uniformly from our parameter space, the best parameter will be different for every run. To make our results reproducible, we provided our best parameters below.

Analysis of Decision Tree, Random Forest & XGBoost Classifiers

We can see that XGBoost is giving us the best result out of the 3 decision-tree-based models, this is the expected result as XGBoost is utilising the gradient boosting method to minimise loss. This method is based on iterative learning which means the model will predict something initially and analyses its mistakes as a predictive toiler and give more weightage to the data points in which it made a wrong prediction in the next iteration. This makes the process more systematic than random samples for the decision trees in random forest, and hence leads to better results. We will include only XGBoost as the representative for tree-based models in our final ensembled model.

K-Nearest Neighbours

The K-Nearest Neighbours algorithm captures the similarity between the data point to be classified and its k number of nearest neighbours determined by calculating distance between them. Among the k nearest neighbours, a voting mechanism is implemented and the majority will be designated as the data point's class.

Image showing data point(blue star) being classified as 'red circle' instead of 'green square'

Support Vector Machine

The Support Vector Machine algorithm is used to find a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points. A good hyperplane has large margins to faciliate easier classification of points in the future.

Neural Network

Neural network is inspired by the biological brain, simulating neurons connecting to one another forming a complex network of neurons. A Deep Neural Network can be represented as a hierarchical (layered) organization of neurons with connections to other neurons. These neurons pass a message or signal to other neurons based on the received input and form a complex network that learns with some feedback mechanism.

Dense Layers:Used for changing the dimension of the vectors by using every neuron, where all neurons of this layer is connected to all neurons of the preceding layer.

Activation Layers: Introduces the non-linearity into the networks of neural networks so that the networks can learn the relationship between the input and output values.

Dropout layers: Prevents overfitting by randomly ignoring or “dropping out” some number of layer outputs.

Model Ensembling - Blending

Steps of Blending

  1. The test set is further split into validation test set and final test set.
  2. The validation test set and its predictions are used as features to build a new model.
  3. This model is used to make final predictions on the final test set.

Sklearn's logistic regression is made to choose 'one versus rest' approach when the predictions are binary (which is our case), where one model is created for each class and the final class is determined through argmax of probabilities of each class.

The ensembled model gave only a small advantage compared to the indicidual XGBoost model, but it may be improved with models with more differentiated classification methods.

Real-time model prediction

Required inputs from user\:

According to our models, we are able to determine the overall ranking of features from most important to least important as shown below:

  1. ap_hi
  2. year
  3. chol
  4. bmi

Since we are not able to advise on our user's age, we will advice on the rest of the variables accordingly. For bmi, since weight is a more controllable factor than height, we will advice the user on weight

Advice Websites